MEDB 5501, Module07

2025-10-07

Topics to be covered

  • What you will learn
    • Categorical independent variables
    • R code for categorical independent variables
    • Multiple linear regression
    • R code for multiple linear regression
    • Diagnostic plots and multicollinearity
    • R code for diagnostic plots and multicollinearity
    • Your homework

Categorical independent variables, 1

  • Regression equation
    • \(Y_i=\beta_0+\beta_1 X_i+\epsilon_i\)
  • How do you modify this if \(X_i\) is categorical?
    • Indicator variables
  • Examples
    • Treatment: active drug=1, placebo=0
    • Second hand smoke: exposed=1, not exposed=0
    • Gender: male=1, female=0
  • To be discussed later: three of more category levels

Categorical independent variables, 2

  • If \(X_i\) = 0
    • \(Y_i=\beta_0 + \beta_1 (0) + \epsilon_i\)
    • \(Y_i=\beta_0 + \epsilon_i\)
  • If \(X_i\) = 1
    • \(Y_i=\beta_0 + \beta_1 (1) + \epsilon_i\)
    • \(Y_i=\beta_0 + \beta_1 + \epsilon_i\)

Categorical independent variables, 3

  • Intperetation
    • \(b_0\) is the estimated average value of Y when X equals the “zero category”
    • \(b_1\) is the estimated average change in Y when X changes from the “zero category” to the “one category.”

Creating an indicator variable

    fev sex sex_male
1 1.708   F        0
2 1.724   F        0
3 1.720   F        0
4 1.558   M        1
5 1.895   M        1
6 2.336   F        0

Graphical display using the indicator variable

Linear regression using the indicator variable


Call:
lm(formula = fev ~ sex_male, data = fev_a)

Coefficients:
(Intercept)     sex_male  
     2.4512       0.3613  

The estimated average fev value is 2.45 liters for females. The estimated average fev value is 0.36 liters larger for males.

Graphical display using alternate indicator variable

Linear regression using alternate indicator variable


Call:
lm(formula = fev ~ sex_female, data = fev_b)

Coefficients:
(Intercept)   sex_female  
     2.8124      -0.3613  

Letting your software create the indicator variable

  • Different rules for different software
    • SPSS, SAS: first alphabetical category=1, second=0
    • R: second alphabetical category=1, first=0
  • Always compare your output to the descriptive statistics

Break #1

  • What you have learned
    • Categorical independent variables
  • What’s coming next
    • R code for categorical independent variables

fev data dictionary

Refer to the data dictionary.

simon-5501-07-fev.qmd

Refer to part 1 of my code.

Break #2

  • What you have learned
    • R code for categorical independent variables
  • What’s coming next
    • Multiple linear regression

Model

  • \(Y_i=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\epsilon_i,\ i=1,...,N\)
  • Least squares estimates: \(b_0,\ b_1,\ b_2\)

Interpretations

  • \(b_0\) is the estimated average value of Y when X1 and X2 both equal zero.
  • \(b_1\) is the estimated average change in Y
    • when \(X_1\) increases by one unit, and
    • \(X_2\) is held constant
  • \(b_2\) is the estimated average change in Y
    • when \(X_2\) increases by one unit, and
    • \(X_1\) is held constant

Unadjusted relationship between height and FEV

Relationship between height and FEV controlling at Age=3

Relationship between height and FEV controlling at Age=4

Relationship between height and FEV controlling at Age=5

Relationship between height and FEV controlling at Age=6

Relationship between height and FEV controlling at Age=7

Relationship between height and FEV controlling at Age=8

Relationship between height and FEV controlling at Age=9

Relationship between height and FEV controlling at Age=10

Relationship between height and FEV controlling at Age=11

Relationship between height and FEV controlling at Age=12

Relationship between height and FEV controlling at Age=13

Relationship between height and FEV controlling at Age=14

Relationship between height and FEV controlling at Age=15

Relationship between height and FEV controlling at Age=16

Relationship between height and FEV controlling at Age=17

Relationship between height and FEV controlling at Age=18

Relationship between height and FEV controlling at Age=19

Unadjusted relationship between age and fev

Relationship between age and FEV controlling for height between 46 and 49.5

Relationship between age and FEV controlling for height between 50 and 53.5

Relationship between age and FEV controlling for height between 54 and 57.5

Relationship between age and FEV controlling for height between 58 and 61.5

Relationship between age and FEV controlling for height between 62 and 65.5

Relationship between age and FEV controlling for height between 66 and 69.5

Relationship between age and FEV controlling for height between 70 and 73.5

Break #3

  • What you have learned
    • Multiple linear regression
  • What’s coming next
    • R code for multiple linear regression

simon-5501-07-fev.qmd

Refer to part 2 of my code.

Break #4

  • What you have learned
    • R code for multiple linear regression
  • What’s coming next
    • Diagnostic plots and multicollinearity

Assumptions

  • Population model
    • \(Y_i=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\epsilon_i,\ i=1,...,N\)
  • Assumptions about \(\epsilon_i\)
    • Normal distribution
    • Mean 0
    • Standard deviation sigma
    • Independent

Residuals

  • \(\hat{Y}_i = b_0 + b_1 X_{1i} + b_2 X_{2i}\)
  • \(e_i = Y_i - \hat{Y}_i\)
    • Behavior of \(e_i\) helps evaluate assumptions about \(\epsilon_i\)

Assessing normality assumption

  • Normal probability plot
  • Histogram

Assessing heterogeneity, nonlinearity

  • Plot \(e_i\) versus \(\hat{Y}_i\)
    • Composite of \(X_1\) and \(X_2\)
    • Look for differences in variation
    • Look for curved pattern

Independence is always assessed qualitatively

  • Look at how your data was collected
    • Are there clusters?
      • Observations within a cluster are sometimes correlated
    • Is the outcome affected by proximity?
      • Results for infectious diseases may be correlated
  • Avoid tests of independence
    • Durbin-Watson or runs tests are bad
    • More about this in Biostats-2

Influential values

  • Leverage
    • Compare to 3*(k+1)/n
      • k is number of independent variables
  • Studentized deleted residual
    • Compare to \(\pm 3\)
  • Cook’s distance
    • Compare to 1

Leverage, 1

# A tibble: 15 × 9
     fev   age height .fitted  .resid   .hat .sigma  .cooksd .std.resid
   <dbl> <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
 1  1.16     7   47     0.926  0.239  0.0148  0.420 0.00165       0.574
 2  2.91    18   66     3.61  -0.702  0.0200  0.419 0.0194       -1.69 
 3  5.10    19   72     4.32   0.782  0.0171  0.419 0.0205        1.88 
 4  3.52    19   66     3.66  -0.143  0.0262  0.420 0.00107      -0.345
 5  3.34    19   65.5   3.61  -0.262  0.0274  0.420 0.00376      -0.633
 6  3.08    18   64.5   3.44  -0.361  0.0231  0.420 0.00598      -0.870
 7  2.90    16   63     3.17  -0.267  0.0150  0.420 0.00208      -0.641
 8  4.22    18   68     3.83   0.393  0.0167  0.420 0.00506       0.944
 9  3.5     17   62     3.11   0.386  0.0228  0.420 0.00672       0.929
10  2.61    16   62     3.06  -0.452  0.0170  0.420 0.00679      -1.09 
11  4.09    18   67     3.72   0.369  0.0183  0.420 0.00487       0.887
12  4.40    18   70.5   4.10   0.303  0.0141  0.420 0.00251       0.726
13  2.28    15   60     2.79  -0.508  0.0160  0.420 0.00810      -1.22 
14  2.85    18   60     2.95  -0.0963 0.0359  0.420 0.000678     -0.234
15  2.80    16   63     3.17  -0.375  0.0150  0.420 0.00410      -0.900

Leverage, 2

Studentized residuals, 1

# A tibble: 7 × 9
    fev   age height .fitted .resid    .hat .sigma .cooksd .std.resid
  <dbl> <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
1  1.72     8   67.5    3.23  -1.51 0.0131   0.416  0.0578      -3.61
2  5.22    12   70      3.72   1.50 0.00637  0.416  0.0276       3.59
3  2.54    14   71      3.94  -1.40 0.00610  0.416  0.0229      -3.35
4  2.22    13   68      3.56  -1.34 0.00377  0.417  0.0129      -3.20
5  5.79    15   69      3.77   2.02 0.00604  0.412  0.0472       4.83
6  5.63    17   73      4.32   1.31 0.0104   0.417  0.0347       3.14
7  5.64    17   70      3.99   1.65 0.0108   0.415  0.0565       3.94

Studentized residuals, 2

Studentized residuals, 3

Cook’s distance

In the pulmonary database, no combination of high leverage and extreme studentized residuals is going to cause concern.

Note: interpreting influential values gets tricky with two independent variables.

Multicollinearity

  • Synonyms:
    • Collinearity
    • Ill-conditioning
    • Near collinearity
  • When two variables are correlated
  • When three or more variables add up to nearly a constant

Problems caused by multicollinearity

  • Intepretation
    • What does “holding one variable constant” mean?
    • Difficult to disentangle the individual impacts
  • Inflated standard errors
    • Very wide confidence intervals
    • Loss of statistical power
  • Note: multicollinearity is NOT a violation of assumptions

Variance inflation factor (VIF)

  • How much precision is lost due to multicollinearity
  • Values larger than 10 are cause for concern

Break #5

  • What you have learned
    • Diagnostic plots and multicollinearity
  • What’s coming next
    • R code for diagnostic plots and multicollinearity

simon-5501-07-fev.qmd

Refer to part 3 of my code.

Break #6

  • What you have learned
    • R code for diagnostic plots and multicollinearity
  • What’s coming next
    • Your homework

simon-5501-07-directions.qmd

Refer to the programming assingment on my github site.

Summary

  • What you have learned
    • Categorical independent variables
    • R code for categorical independent variables
    • Multiple linear regression
    • R code for multiple linear regression
    • Diagnostic plots and multicollinearity
    • R code for diagnostic plots and multicollinearity
    • Your homework